Our final project was exploring the effects of the COVID-19 pandemic on mental health, specifically in the United States. We will attempt to answer the questions:
How has the progression of the COVID-19 pandemic affected the mental health of people in the United States?
Has the COVID-19 pandemic increased mental health disparities in the United States?
This question is meaningful to everyone in the United States as many people have been negatively impacted by the pandemic and would benefit from increased focus on mental health in research and policy related to recovery from the pandemic. We hope that our findings will help advance knowledge about Americans’ vulnerability to negative mental health outcomes so that susceptible groups can be prioritized in government response and assistance programs related to COVID-19 or future pandemics.
Our team will use a time series data set from the U.S. Census Bureau’s Household Pulse Survey (HPS) to describe indicators of mental health outcomes associated with the COVID-19 pandemic. The HPS was developed to help understand the social and economic impacts of COVID-19 on American households. The time series data set we will use contains measures of anxiety and depression across various demographics in the United States. We will also relate the HPS data to data from the U.S. Census Bureau’s American Community Survey (ACS) to further analyze the proportion of people experiencing a change in mental health. The ACS is one of the most comprehensive sources of population information about the United States, so we will use ACS data to identify other social, economic, housing, or demographic factors associated with changes in mental health during the pandemic. In addition, our team will use web scraping on various social media platforms to see if there is an increase in the frequency of posts regarding anxiety, depression, or loneliness since the start of the pandemic.
Our research will involve investigating several parts:
Part 1: Analyzing Census Data by Demographics
Part 2: Visualizing Change in Mental Health Over Time
Part 3: Webscraping Reddit
We will join together the Household Pulse Survey data and the American Community Survey data to determine whether economic and social factors are associated with mental health outcomes (anxiety and depression during the pandemic).
The most recent ACS data was collected in 2020. According to the US Census Bureau, the 2020 ACS 1-year experimental tables use an experimental estimation methodology and should not be compared with other ACS data. Therefore, we will only look at median household income and access to internet in 2020. Since we are using statewide 2020 ACS data, we can only join it with statewide HPS data collected in 2020.
We will analyze the data by comparing disparities in indicator values for anxiety or depression between certain demographics. In particular, we will investigate how differences in age, ethnicity, education, states, access to computers and internet, and median household income may affect indicator values for anxiety or depression.
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
### Load the HPS data on mental health:
hps = pd.read_csv("Indicators_of_Anxiety_or_Depression_Based_on_Reported_Frequency_of_Symptoms_During_Last_7_Days.csv")
# Rename the phase 3 rows to remove text:
hps.Phase = hps.Phase.apply(lambda x: '3' if '3' in x else x)
# Filter the HPS data to data collected in 2020:
# 2020 data is in all time perionds in phases 1 and 2 and in time periods 18 - 21 in phase 3.
# Data collection info obtained from https://www.census.gov/data/experimental-data-products/household-pulse-survey.html
hps_2020 = hps.query("Phase == '1' | Phase == '2' | `Time Period` in [18, 19, 20, 21]")
hps_2020 = hps_2020.query("Group == 'By State'")
### Tidy the ACS data:
# Load the median household income (mhhi) data:
mhhi = pd.read_csv("XK201902.csv")
# Remove margin of error cols and estimate label row:
mhhi_error = mhhi.columns[mhhi.columns.str.startswith('Unnamed')]
mhhi.drop(mhhi_error, axis = 1, inplace = True)
mhhi.drop(index = 0, inplace = True)
# Pivot longer with melt:
mhhi = mhhi.melt(var_name = "State", value_name = "mhhi")
# Drop the first and second rows (label and US mhhi):
mhhi.drop(mhhi.index[:2], inplace = True)
mhhi.reset_index(drop = True, inplace = True)
# Change mhhi to numeric:
mhhi["mhhi"] = mhhi["mhhi"].astype(int)
### Tidy the Internet Access Data
# Load the access to computer & internet data:
internet = pd.read_csv("XK202801.csv")
# Remove margin of error cols and estimate label row:
internet_error = internet.columns[internet.columns.str.startswith('Unnamed')]
internet.drop(internet_error, axis = 1, inplace = True)
internet.drop(index = 0, inplace = True)
# Pivot longer to get state column:
internet = internet.melt(id_vars = "\n\nLabel", var_name = "State", value_name = "households")
# Drop columns associated with entire US (first 6 columns):
internet.drop(internet.index[:6], inplace = True)
internet.reset_index(drop = True, inplace = True)
# Pivot wider to get internet/computer access as columns:
internet = internet.pivot(index = "State", columns = "\n\nLabel", values = "households")
internet.reset_index(inplace = True)
# list(internet.columns)
internet.rename(columns = {" With a broadband Internet subscription": "computer_with_broadband_internet",
" With dial-up Internet subscription alone": "computer_with_dial_up_internet",
" Without an Internet subscription": "computer_without_internet",
" Has a computer:": "total_computer",
" No computer": "total_no_computer",
"Total:": "total_households"},
inplace = True)
# Change access columns to numeric:
internet_access_cols = list(internet.columns[1:])
internet[internet_access_cols] = internet[internet_access_cols].apply(pd.to_numeric)
internet
# Calculate the proportions of households in each state with access to computer/internet:
internet["prop_no_computer"] = internet["total_no_computer"].div(internet["total_households"])
internet["prop_computer"] = internet["total_computer"].div(internet["total_households"])
internet["comp_with_internet"] = internet["computer_with_broadband_internet"] + internet["computer_with_dial_up_internet"]
internet["prop_computer_internet"] = internet["comp_with_internet"].div(internet["total_computer"])
internet["prop_computer_no_internet"] = internet["computer_without_internet"].div(internet["total_computer"])
# Select and rename proportion columns:
internet = internet[["State", "prop_computer_internet", "prop_computer_no_internet", "prop_no_computer",]]
internet.rename(columns = {"prop_no_computer": "no_computer",
"prop_computer_internet": "computer_with_internet",
"prop_computer_no_internet": "computer_without_internet"},
inplace = True)
# Join the 2020 HPS data with the 2020 mhhi data:
joined_2020 = pd.merge(hps_2020, mhhi, how = "outer", on = "State")
# Select variables of interest:
# Value = percentage of adults who report symptoms of anxiety or depression that have been shown to be associated
# with diagnoses of generalized anxiety disorder or major depressive disorder.
joined_2020 = joined_2020[["Indicator", "State", "Time Period", "Time Period Label", "Value", "mhhi"]]
# Join the computer/internet access data with the HPS and mhhi data:
joined_2020 = pd.merge(joined_2020, internet, how = "outer", on = "State")
joined_2020
| Indicator | State | Time Period | Time Period Label | Value | mhhi | computer_with_internet | computer_without_internet | no_computer | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Symptoms of Depressive Disorder | Alabama | 1 | Apr 23 - May 5, 2020 | 18.6 | 53956 | 0.914707 | 0.085293 | 0.086134 |
| 1 | Symptoms of Anxiety Disorder | Alabama | 1 | Apr 23 - May 5, 2020 | 25.6 | 53956 | 0.914707 | 0.085293 | 0.086134 |
| 2 | Symptoms of Anxiety Disorder or Depressive Dis... | Alabama | 1 | Apr 23 - May 5, 2020 | 30.3 | 53956 | 0.914707 | 0.085293 | 0.086134 |
| 3 | Symptoms of Depressive Disorder | Alabama | 2 | May 7 - May 12, 2020 | 22.5 | 53956 | 0.914707 | 0.085293 | 0.086134 |
| 4 | Symptoms of Anxiety Disorder | Alabama | 2 | May 7 - May 12, 2020 | 27.2 | 53956 | 0.914707 | 0.085293 | 0.086134 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3208 | Symptoms of Anxiety Disorder | Wyoming | 20 | Nov 25 - Dec 7, 2020 | 38.7 | 66432 | 0.926483 | 0.073517 | 0.055207 |
| 3209 | Symptoms of Anxiety Disorder or Depressive Dis... | Wyoming | 20 | Nov 25 - Dec 7, 2020 | 44.8 | 66432 | 0.926483 | 0.073517 | 0.055207 |
| 3210 | Symptoms of Depressive Disorder | Wyoming | 21 | Dec 9 - Dec 21, 2020 | 24.7 | 66432 | 0.926483 | 0.073517 | 0.055207 |
| 3211 | Symptoms of Anxiety Disorder | Wyoming | 21 | Dec 9 - Dec 21, 2020 | 33.6 | 66432 | 0.926483 | 0.073517 | 0.055207 |
| 3212 | Symptoms of Anxiety Disorder or Depressive Dis... | Wyoming | 21 | Dec 9 - Dec 21, 2020 | 39.0 | 66432 | 0.926483 | 0.073517 | 0.055207 |
3213 rows × 9 columns
We will begin visualizing results from the joined data set based on age, ethnicity, and education level.
# HPS by Age:
hps_by_age = hps.query("Group == 'By Age'")
both_by_age = hps_by_age.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_age = both_by_age.groupby("Subgroup").mean("Value")
both_by_age.reset_index(inplace = True)
# average frequency of depression or anxiety by age
sns.barplot(data = both_by_age, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Age",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Age")
[Text(0.5, 1.0, 'Mean Indicator of Depression or Anxiety Based on Age'), Text(0.5, 0, 'Mean Indicator of Depression or Anxiety'), Text(0, 0.5, 'Age')]
According to the barplot above, age seems to correlate to the value—the indicator of level of depression or anxiety an individual has. The youngest age bin seems to have the highest mean value while the oldest age bin has the lowest mean value. The mean value decreases as the age bin increases.
# HPS by race:
hps_by_race = hps.query("Group == 'By Race/Hispanic ethnicity'")
both_by_race = hps_by_race.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_race = both_by_race.groupby("Subgroup").mean("Value")
both_by_race.reset_index(inplace = True)
# Average frequency of depression or anxiety by race
sns.barplot(data = both_by_race, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Ethnicity",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Ethnicity")
[Text(0.5, 1.0, 'Mean Indicator of Depression or Anxiety Based on Ethnicity'), Text(0.5, 0, 'Mean Indicator of Depression or Anxiety'), Text(0, 0.5, 'Ethnicity')]
According to the barplot above, Non-Hispanic Asians have the lowest mean value while Non-Hispanic (other races and multiple races) have the highest mean value.
# HPS by education:
hps_by_education = hps.query("Group == 'By Education'")
both_by_education = hps_by_education.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_education = both_by_education.groupby("Subgroup").mean("Value")
both_by_education.reset_index(inplace = True)
# Average frequency of depression or anxiety by race
sns.barplot(data = both_by_education, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Education Level",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Education Level")
# inverse relationship b/t education and frequency of symptoms
[Text(0.5, 1.0, 'Mean Indicator of Depression or Anxiety Based on Education Level'), Text(0.5, 0, 'Mean Indicator of Depression or Anxiety'), Text(0, 0.5, 'Education Level')]
According to the barplot above, it seems that a lower level of education correlates to higher values of indicators of depression or anxiety. Interestingly, it seems that those that have some college or Associate's degree have a higher mean value than those with a high school diploma or GED. However, those with less than a high school diploma have the highest mean indicator value overall and those with a Bachelor's degree or higher have the lowest mean indicator value overall.
import plotly.express as px
# Statewide reported frequency of anxiety or depression for all HPS time periods:
# (to use for the map)
hps_state_avgs = hps.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
hps_state_avgs = hps_state_avgs.query("Group == 'By State'")
hps_state_avgs = hps_state_avgs.groupby("State").mean("Value")
abbreviation = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS",
"KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC",
"ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
hps_state_avgs["abbreviation"] = abbreviation
hps_state_avgs.reset_index(inplace = True)
hps_state_avgs = hps_state_avgs[["State", "abbreviation", "Value"]]
fig = px.choropleth(hps_state_avgs, locations = "abbreviation", locationmode = "USA-states",
color = "Value", scope = "usa", labels = {'Value': 'Percent', 'abbreviation': 'State'},
title = "Average percentage of HPS respondents who reported symptoms of anxiety or depression <br>"
"between April 23, 2020 and January 10, 2022</a>",
color_continuous_scale = "Darkmint")
fig.show()
# higher avgs in south and west states
The highest percentage of respondents who reported symptoms of anxiety or depression typically came from southern and western states.
# Create data tables for 2020 depression and anxiety
depression_2020 = joined_2020.query("Indicator == 'Symptoms of Depressive Disorder'")
depression_2020 = depression_2020.groupby("State").mean(["Value", "mhhi"])
#depression_2020.to_csv("depression_by_state_2020.csv")
anxiety_2020 = joined_2020.query("Indicator == 'Symptoms of Anxiety Disorder'")
anxiety_2020 = anxiety_2020.groupby("State").mean(["Value", "mhhi"])
#anxiety_2020.to_csv("anxiety_by_state_2020.csv")
depression_or_anxiety_2020 = joined_2020.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
depression_or_anxiety_2020 = depression_or_anxiety_2020.groupby("State").mean(["Value", "mhhi"])
#depression_or_anxiety_2020.to_csv("depression_or_anxiety_by_state_2020.csv")
# Deleted time period column in all tables in excel
sns.regplot(x="computer_with_internet",
y="Value",
data=depression_2020).set(title = "Scatterplot Between Internet Access and Indicator of Anxiety or Depression",
xlabel = "Proportion of People with Computer and Internet Access",
ylabel = "Indicator Value for Depression or Anxiety")
[Text(0.5, 1.0, 'Scatterplot Between Internet Access and Indicator of Anxiety or Depression'), Text(0.5, 0, 'Proportion of People with Computer and Internet Access'), Text(0, 0.5, 'Indicator Value for Depression or Anxiety')]
According to the scatterplot above, the proportion of people with computer and internet access seems to have a negative relationship with the indicator value for depression or anxiety. This means that a higher proportion of people with computer and internet access should lead to a lower indicator value for depression and anxiety.
sns.regplot(x="mhhi",
y="Value",
data=depression_or_anxiety_2020).set(title = "Scatterplot Between Median Household Income and Indicator of Anxiety or Depression",
xlabel = "Median Household Income",
ylabel = "Indicator Value for Depression or Anxiety")
[Text(0.5, 1.0, 'Scatterplot Between Median Household Income and Indicator of Anxiety or Depression'), Text(0.5, 0, 'Median Household Income'), Text(0, 0.5, 'Indicator Value for Depression or Anxiety')]
According to the scatterplot above, there seems to be no real correlation between median household income and the indicator value for depression or anxiety.
Before analyzing our data set, we would expect that certain demographics would have higher indicator values for depression or anxiety. For example, we hypothesized that older age groups would experience higher indicator values for depression or anxiety as they are the most susceptible to COVID-19. We also hypothesized that Asian populations would experience higher levels of depression or anxiety due to a rise in anti-Asian sentiment following the COVID-19 outbreak. And we expected that those with higher median household incomes would experience far less levels of depression or anxiety than those with lower income.
After analyzing the data set, we found our expectations were defied. According to our visualizations, younger groups experienced the highest levels of depression or anxiety while the oldest groups experiences the lowest levels. We also found that Asian populations in fact had the lowest levels of depression or anxiety. Lastly, we found that median household income had little to no correlation with levels of depression or anxiety.
In order to address one of our key questions "Has the COVID-19 pandemic increased mental health disparities in the United States?", we will create time series graphs of various demographics over the pandemic to investigate whether or not disparities among these demographics have increased. In particular, we will look at the disparities between age, ethnicity, and education over the pandemic from April 2020 to December 2020. The time period corresponds to the number of weeks starting on April 23, 2020.
We will investigate the biggest disparities that we've found in part 1. That is: ages 18-29 versus ages 80+, Non-Hispanic Asians versus Non-Hispanic (other races and multiple races), and less than a high school diploma versus bachelor's degree or higher.
# by age faceted
sns.relplot(
data=hps_by_age,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 3
)
<seaborn.axisgrid.FacetGrid at 0x22324671ee0>
According to the time series plots above, we see that ages 18 to 29 levels of depression or anxiety have gone up while ages 80+ levels of depression or anxiety have gone down.
# by race faceted
sns.relplot(
data=hps_by_race,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 2
)
<seaborn.axisgrid.FacetGrid at 0x22324cbc730>
According to the time series plots above, levels of depression or anxiety for Non-Hispanic Asians have dropped significantly while levels for Non-Hispanic (other races and multiple races) have slightly decreased.
# by education faceted
sns.relplot(
data=hps_by_education,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 2
)
<seaborn.axisgrid.FacetGrid at 0x223258c0c40>
According to the time series plots above, we can see that levels of anxiety or depression for less than a high school diploma has dropped significantly while levels for bachelor's degree or higher has only slightly decreased.
While investigating disparities over time among demographics such as age, ethnicity, and education, we found that while groups varied in terms of change in levels of anxiety or depression, all demographics have experienced a large increase in levels during the 20th week of our time period which was followed by a steep drop around the 30th week of our time period.
We found that the disparities among our demographics for age and ethnicity have widened over the course of the pandemic, however the gap in levels of anxiety or depression between education has decreased
In part 1, we found that the age demographic that experienced the highest levels of depression or anxiety were young adults aged 18-29. Since the main age demographic if Reddit users consists of young adults, we will utilize web scraping in anxiety and depression subreddits to determine whether young adults used Reddit as a platform to discuss mental health during the pandemic.
First we will investigate the frequency of the top 1000 reddit posts in anxiety and depression subreddits to see if there was an increase of posts during the pandemic. Then, we will create a word cloud of the comments from the posts to investigate common stressors people faced to see if there is an association between the pandemic and comments.
# Reddit Scraping
import praw
from praw.models import MoreComments
# Essentials
import numpy as np
from datetime import datetime
# Visualizations
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
import matplotlib.pyplot as plt
# function for converting a post's unix time to a time format we can work with
def get_date(post):
time = post.created
# time is in UNIX time by default
time_values = datetime.fromtimestamp(time)
readable_time = f"{time_values:%Y-%m-%d}"
return(readable_time)
# Creates an application for webscraping reddit
reddit = praw.Reddit(client_id = "tlMWGjGI-3PnZeYuNbuPCQ",
client_secret = "lo8V9_eoqqFcwAlhoz1Xeu_PeLHW0g",
user_agent = "Web Scraping")
# Grabs the 1000 most popular reddit posts in the subreddit "r/depression"
r_depression_posts = []
for post in reddit.subreddit("depression").top("all",limit=1000):
r_depression_posts.append(post)
# Grabs the 1000 most popular reddit posts in the subreddit "r/anxiety"
r_anxiety_posts = []
for post in reddit.subreddit("anxiety").top("all",limit=1000):
r_anxiety_posts.append(post)
# Puts the r/depression posts into a dataframe with relevant information
r_depression_post_info = []
for post in r_depression_posts:
r_depression_post_info.append([post.title,
post.score,
post.selftext,
post.num_comments,
get_date(post)])
r_depression_df = pd.DataFrame(r_depression_post_info,
columns=["Title","Score","Post Body","Comments","Date"])
# Puts the r/anxiety posts into a dataframe with relevant information
r_anxiety_post_info = []
for post in r_anxiety_posts:
r_anxiety_post_info.append([post.title,
post.score,
post.selftext,
post.num_comments,
get_date(post)])
r_anxiety_df = pd.DataFrame(r_anxiety_post_info,
columns=["Title","Score","Post Body","Comments","Date"])
# We will see how many of the top 1000 posts of each subreddit come from each year
year = [0] * len(r_anxiety_df)
for i in range(len(r_anxiety_df)):
year[i] = r_anxiety_df["Date"][i][:4]
r_anxiety_df["Year"] = year
year = [0] * len(r_depression_df)
for i in range(len(r_depression_df)):
year[i] = r_depression_df["Date"][i][:4]
r_depression_df["Year"] = year
# Gets the unique years and how frequently they occur
depression_labs,depression_counts = np.unique(r_depression_df["Year"],return_counts=True)
depression_year_freq = pd.DataFrame({"Year":depression_labs,
"Frequency":depression_counts})
# Plots the frequency
ax = sns.lineplot(x="Year",y="Frequency",data = depression_year_freq)
ax.set(title = "Frequency of Top 1000 Posts on r/Depression")
[Text(0.5, 1.0, 'Frequency of Top 1000 Posts on r/Depression')]
According to the graph above, we see that there is a high frequency of posts starting at 2019.
# Gets the unique years and how frequently they occur
anxiety_labs,anxiety_counts = np.unique(r_anxiety_df["Year"],return_counts=True)
anxiety_year_freq = pd.DataFrame({"Year":anxiety_labs,
"Frequency":anxiety_counts})
# Plots the frequency
ax = sns.lineplot(x="Year",y="Frequency",data = anxiety_year_freq)
ax.set(title = "Frequency of Top 1000 Posts on r/Anxiety")
[Text(0.5, 1.0, 'Frequency of Top 1000 Posts on r/Anxiety')]
Similarly to the graph of frequency of top 1000 posts on the depression subreddit, we see a high frequency of posts starting at 2019
# We will have a list of key words to identify posts that are related to covid
covid_key_words = ["covid",
"corona",
"lockdown",
"epidemic"]
# Count the mentions of covid in the r/depression posts
covid_mentions = [0] * len(r_depression_df)
for i in range(len(r_depression_df)):
count = 0
for word in covid_key_words:
if word in r_depression_df["Post Body"][i].lower():
count +=1
if word in r_depression_df["Title"][i].lower():
count +=1
covid_mentions[i] = count
# Create a new column in the data frame
r_depression_df["Covid Mentions"] = covid_mentions
# Group the covid mentions by date
r_depression_by_date = r_depression_df.groupby("Date")["Covid Mentions"].sum()
# Put the dates and covid mentions in a table
table2 = pd.DataFrame({"Date":pd.to_datetime(r_depression_by_date.index),
"Covid Mentions":list(r_depression_by_date)})
# Create a heat map where the darker the box, the higher the frequency of posts
ax = sns.histplot(x="Date",y="Covid Mentions",data = table2)
ax.set(title = "Heat Map of Covid-19 Related Posts on r/Depression")
# Notice how there is a very high frequency of posts during the start of lockdown (December 2019)
[Text(0.5, 1.0, 'Heat Map of Covid-19 Related Posts on r/Depression')]
According to the heat map, while many of the posts don't mention COVID-19 directly, there is still a high frequency of posts after 2019.
# Counts how many times our key words are mentioned in either the title or body of the post
covid_mentions = [0] * len(r_anxiety_df)
for i in range(len(r_anxiety_df)):
count = 0
for word in covid_key_words:
if word in r_anxiety_df["Post Body"][i].lower():
count +=1
if word in r_anxiety_df["Title"][i].lower():
count +=1
covid_mentions[i] = count
# Create a new column in our data frame with how many counts of covid references
r_anxiety_df["Covid Mentions"] = covid_mentions
# Group the posts by date
r_anxiety_by_date = r_anxiety_df.groupby("Date")["Covid Mentions"].sum()
# Put the dates and covid references in a data frame
table = pd.DataFrame({"Date":pd.to_datetime(r_anxiety_by_date.index),
"Covid Mentions":list(r_anxiety_by_date)})
# Create a heatmap where the darker the box, the higher frequency of posts
ax = sns.histplot(x="Date",y="Covid Mentions",data = table)
ax.set(title = "Heat Map of Covid-19 Related Posts on r/Anxiety")
# Notice how although covid isn't mentioned in the majority of posts,
# there is a much higher frequency of posts since the beginning of the lockdown (December 2019)
[Text(0.5, 1.0, 'Heat Map of Covid-19 Related Posts on r/Anxiety')]
Similarly to the heat map of COVID-19 related posts on the depression subreddit, while many of the posts don't mention COVID-19 directly, there is still a high frequency of posts after 2019.
# Grabs the comments from the top 100 posts
depression_comments = []
for post in r_depression_posts[:100]:
for comment in post.comments:
if isinstance(comment,MoreComments):
continue
depression_comments.append(comment.body)
depression_comments = pd.DataFrame(depression_comments,columns=["body"])
# Puts all of the comments into one string
all_depression_comments = ""
for i in depression_comments["body"]:
all_depression_comments += i
all_depression_comments += " "
# Creates a word cloud out of our comment string
wordcloud = WordCloud(max_words=50,background_color="white").generate(all_depression_comments)
plt.figure(figsize=(15,5))
plt.title("Word Cloud of the Top 100 Posts in r/Depression",size=25)
plt.imshow(wordcloud,interpolation="bilinear")
<matplotlib.image.AxesImage at 0x223315a4c10>
The graph above is a word cloud of the comments of the top 100 posts on the depression subreddit. We can see that "people", "time", and "work" appear in the cloud, meaning that they are mentioned very frequently.
# Grabs the comments from the top 100 posts
anxiety_comments = []
for post in r_anxiety_posts[:100]:
for comment in post.comments:
if isinstance(comment,MoreComments):
continue
anxiety_comments.append(comment.body)
anxiety_comments = pd.DataFrame(anxiety_comments,columns=["body"])
# Puts all of the comments into one string
all_anxiety_comments = ""
for i in anxiety_comments["body"]:
all_anxiety_comments += i
all_anxiety_comments += " "
# Creates a word cloud out of our comment string
wordcloud = WordCloud(max_words=45,background_color="white").generate(all_anxiety_comments)
plt.figure(figsize=(15,5))
plt.title("Word Cloud of the Top 100 Posts in r/Anxiety",size=25)
plt.imshow(wordcloud,interpolation="bilinear")
<matplotlib.image.AxesImage at 0x22317a061c0>
The graph above is a word cloud of the comments of the top 100 posts on the anxiety subreddit. We can see that similar to the word cloud of the comments from the depression subreddit, "people", "time", and "work" appear in the cloud, meaning that they are mentioned very frequently.
We hypothesized that there would be an increase in post frequency in depression and anxiety subreddits during the pandemic due to the fact that young adults make up most of Reddit's user demographic, and we found that young adults had the highest levels of depression and anxiety in Part 1.
In our frequency graph, we found that there was a high increase in post frequency starting 2019 which coincides with the start of the COVID-19 pandemic which started on December 2019. Furthermore, our heat maps revealed that while only a small fraction of posts directly mentioned COVID-19, almost all of the most popular posts were created during the pandemic.
We found that "people", "time", and "work" appeared in both word clouds of comments from the top 100 posts in anxiety and depression subreddits. This is significant because these topics were arguably heavily impacted by the pandemic
The goal of our project was to investigate the impact of the COVID-19 pandemic on mental health in the United States. To accomplish this, we looked at various data sources including the Household Pulse Survey and American Community Survey from the U.S. Census Bureau along with the social media site "Reddit".
Before analyzing our data set, we hypothesized that certain demographics would experience higher levels of depression or anxiety throughout the pandemic. For example, we expected that older age groups would experience higher indicator values as they are the most susceptible to COVID-19, we expected that Asian populations would experience higher indicator values due to a rise in anti-Asian sentiment following the COVID-19 outbreak, and we expected that those with higher median household incomes would experience far less levels of depression or anxiety tha those with lower income.
In Part 1, we found that all of our hypotheses were wrong. It was actually the case that older age groups and Asian populations experienced the lowest levels of depression or anxiety. And we found that there was little to no correlation between household income and levels of depression or anxiety.
One of our key questions was "How has the progression of the COVID-19 pandemic affected the mental health of people in the United States?". In Part 2, we found that while groups experienced different changes in levels of anxiety or depression throughout the pandemic, all demographics experience a large spike in levels during the 20th week of our time period which was then followed by a steep drop on the 30th week. Furthermore, we noticed that after the entire time period, most people had lower levels of depression and anxiety compared to the first week.
Another one of our key questions was "Has the COVID-19 pandemic increased mental health disparities in the United States?" To answer this question, we looked at the biggest disparities in age, ethnicity, and education and plotted their levels of depression or anxiety over the pandemic. We ultimately found that disparities among age and ethnicity has widened over the course of the pandemic while the gap in levels of anxiety or depression between education has decreased.
After finding how young adults were most susceptible to experiencing high levels of depression or anxiety, we turned to Reddit to investigate whether young adults used Reddit as a platform to discuss mental health during the pandemic as young adults are the biggest user demographic of Reddit. In particular, we webscraped posts on depression and anxiety subreddits. We found that while only a small portion of the most popular posts were COVID-19 related, the vast majority of the posts were created during the pandemic. We created word clouds to see what was most commonly discussed within these posts and discovered that topics such as "people", "work", and "time" were frequently mentioned which were all severely impacted by the pandemic.
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
### Load the HPS data on mental health:
hps = pd.read_csv("Indicators_of_Anxiety_or_Depression_Based_on_Reported_Frequency_of_Symptoms_During_Last_7_Days.csv")
# Rename the phase 3 rows to remove text:
hps.Phase = hps.Phase.apply(lambda x: '3' if '3' in x else x)
# Filter the HPS data to data collected in 2020:
# 2020 data is in all time perionds in phases 1 and 2 and in time periods 18 - 21 in phase 3.
# Data collection info obtained from https://www.census.gov/data/experimental-data-products/household-pulse-survey.html
hps_2020 = hps.query("Phase == '1' | Phase == '2' | `Time Period` in [18, 19, 20, 21]")
hps_2020 = hps_2020.query("Group == 'By State'")
### Tidy the ACS data:
# Load the median household income (mhhi) data:
mhhi = pd.read_csv("XK201902.csv")
# Remove margin of error cols and estimate label row:
mhhi_error = mhhi.columns[mhhi.columns.str.startswith('Unnamed')]
mhhi.drop(mhhi_error, axis = 1, inplace = True)
mhhi.drop(index = 0, inplace = True)
# Pivot longer with melt:
mhhi = mhhi.melt(var_name = "State", value_name = "mhhi")
# Drop the first and second rows (label and US mhhi):
mhhi.drop(mhhi.index[:2], inplace = True)
mhhi.reset_index(drop = True, inplace = True)
# Change mhhi to numeric:
mhhi["mhhi"] = mhhi["mhhi"].astype(int)
### Tidy the Internet Access Data
# Load the access to computer & internet data:
internet = pd.read_csv("XK202801.csv")
# Remove margin of error cols and estimate label row:
internet_error = internet.columns[internet.columns.str.startswith('Unnamed')]
internet.drop(internet_error, axis = 1, inplace = True)
internet.drop(index = 0, inplace = True)
# Pivot longer to get state column:
internet = internet.melt(id_vars = "\n\nLabel", var_name = "State", value_name = "households")
# Drop columns associated with entire US (first 6 columns):
internet.drop(internet.index[:6], inplace = True)
internet.reset_index(drop = True, inplace = True)
# Pivot wider to get internet/computer access as columns:
internet = internet.pivot(index = "State", columns = "\n\nLabel", values = "households")
internet.reset_index(inplace = True)
# list(internet.columns)
internet.rename(columns = {" With a broadband Internet subscription": "computer_with_broadband_internet",
" With dial-up Internet subscription alone": "computer_with_dial_up_internet",
" Without an Internet subscription": "computer_without_internet",
" Has a computer:": "total_computer",
" No computer": "total_no_computer",
"Total:": "total_households"},
inplace = True)
# Change access columns to numeric:
internet_access_cols = list(internet.columns[1:])
internet[internet_access_cols] = internet[internet_access_cols].apply(pd.to_numeric)
internet
# Calculate the proportions of households in each state with access to computer/internet:
internet["prop_no_computer"] = internet["total_no_computer"].div(internet["total_households"])
internet["prop_computer"] = internet["total_computer"].div(internet["total_households"])
internet["comp_with_internet"] = internet["computer_with_broadband_internet"] + internet["computer_with_dial_up_internet"]
internet["prop_computer_internet"] = internet["comp_with_internet"].div(internet["total_computer"])
internet["prop_computer_no_internet"] = internet["computer_without_internet"].div(internet["total_computer"])
# Select and rename proportion columns:
internet = internet[["State", "prop_computer_internet", "prop_computer_no_internet", "prop_no_computer",]]
internet.rename(columns = {"prop_no_computer": "no_computer",
"prop_computer_internet": "computer_with_internet",
"prop_computer_no_internet": "computer_without_internet"},
inplace = True)
# Join the 2020 HPS data with the 2020 mhhi data:
joined_2020 = pd.merge(hps_2020, mhhi, how = "outer", on = "State")
# Select variables of interest:
# Value = percentage of adults who report symptoms of anxiety or depression that have been shown to be associated
# with diagnoses of generalized anxiety disorder or major depressive disorder.
joined_2020 = joined_2020[["Indicator", "State", "Time Period", "Time Period Label", "Value", "mhhi"]]
# Join the computer/internet access data with the HPS and mhhi data:
joined_2020 = pd.merge(joined_2020, internet, how = "outer", on = "State")
# HPS by Age:
hps_by_age = hps.query("Group == 'By Age'")
both_by_age = hps_by_age.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_age = both_by_age.groupby("Subgroup").mean("Value")
both_by_age.reset_index(inplace = True)
# average frequency of depression or anxiety by age
sns.barplot(data = both_by_age, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Age",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Age")
# HPS by race:
hps_by_race = hps.query("Group == 'By Race/Hispanic ethnicity'")
both_by_race = hps_by_race.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_race = both_by_race.groupby("Subgroup").mean("Value")
both_by_race.reset_index(inplace = True)
# Average frequency of depression or anxiety by race
sns.barplot(data = both_by_race, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Ethnicity",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Ethnicity")
# HPS by education:
hps_by_education = hps.query("Group == 'By Education'")
both_by_education = hps_by_education.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
both_by_education = both_by_education.groupby("Subgroup").mean("Value")
both_by_education.reset_index(inplace = True)
# Average frequency of depression or anxiety by race
sns.barplot(data = both_by_education, x = "Value", y = "Subgroup",
orient = "h").set(title = "Mean Indicator of Depression or Anxiety Based on Education Level",
xlabel = "Mean Indicator of Depression or Anxiety",
ylabel = "Education Level")
# inverse relationship b/t education and frequency of symptoms
import plotly.express as px
# Statewide reported frequency of anxiety or depression for all HPS time periods:
# (to use for the map)
hps_state_avgs = hps.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
hps_state_avgs = hps_state_avgs.query("Group == 'By State'")
hps_state_avgs = hps_state_avgs.groupby("State").mean("Value")
abbreviation = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA", "KS",
"KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", "NM", "NY", "NC",
"ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
hps_state_avgs["abbreviation"] = abbreviation
hps_state_avgs.reset_index(inplace = True)
hps_state_avgs = hps_state_avgs[["State", "abbreviation", "Value"]]
fig = px.choropleth(hps_state_avgs, locations = "abbreviation", locationmode = "USA-states",
color = "Value", scope = "usa", labels = {'Value': 'Percent', 'abbreviation': 'State'},
title = "Average percentage of HPS respondents who reported symptoms of anxiety or depression <br>"
"between April 23, 2020 and January 10, 2022</a>",
color_continuous_scale = "Darkmint")
fig.show()
# higher avgs in south and west states
# Create data tables for 2020 depression and anxiety
depression_2020 = joined_2020.query("Indicator == 'Symptoms of Depressive Disorder'")
depression_2020 = depression_2020.groupby("State").mean(["Value", "mhhi"])
#depression_2020.to_csv("depression_by_state_2020.csv")
anxiety_2020 = joined_2020.query("Indicator == 'Symptoms of Anxiety Disorder'")
anxiety_2020 = anxiety_2020.groupby("State").mean(["Value", "mhhi"])
#anxiety_2020.to_csv("anxiety_by_state_2020.csv")
depression_or_anxiety_2020 = joined_2020.query("Indicator == 'Symptoms of Anxiety Disorder or Depressive Disorder'")
depression_or_anxiety_2020 = depression_or_anxiety_2020.groupby("State").mean(["Value", "mhhi"])
#depression_or_anxiety_2020.to_csv("depression_or_anxiety_by_state_2020.csv")
# Deleted time period column in all tables in excel
sns.regplot(x="computer_with_internet",
y="Value",
data=depression_2020).set(title = "Scatterplot Between Internet Access and Indicator of Anxiety or Depression",
xlabel = "Proportion of People with Computer and Internet Access",
ylabel = "Indicator Value for Depression or Anxiety")
sns.regplot(x="mhhi",
y="Value",
data=depression_or_anxiety_2020).set(title = "Scatterplot Between Median Household Income and Indicator of Anxiety or Depression",
xlabel = "Median Household Income",
ylabel = "Indicator Value for Depression or Anxiety")
# by age faceted
sns.relplot(
data=hps_by_age,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 3
)
# by race faceted
sns.relplot(
data=hps_by_race,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 2
)
# by education faceted
sns.relplot(
data=hps_by_education,
x="Time Period", y="Value",
hue="Subgroup", col = "Subgroup",
kind="line", ci = None,
height=5, aspect=.75, facet_kws=dict(sharex=False),
col_wrap = 2
)
# Reddit Scraping
import praw
from praw.models import MoreComments
# Essentials
import numpy as np
from datetime import datetime
# Visualizations
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
import matplotlib.pyplot as plt
# function for converting a post's unix time to a time format we can work with
def get_date(post):
time = post.created
# time is in UNIX time by default
time_values = datetime.fromtimestamp(time)
readable_time = f"{time_values:%Y-%m-%d}"
return(readable_time)
# Creates an application for webscraping reddit
reddit = praw.Reddit(client_id = "tlMWGjGI-3PnZeYuNbuPCQ",
client_secret = "lo8V9_eoqqFcwAlhoz1Xeu_PeLHW0g",
user_agent = "Web Scraping")
# Grabs the 1000 most popular reddit posts in the subreddit "r/depression"
r_depression_posts = []
for post in reddit.subreddit("depression").top("all",limit=1000):
r_depression_posts.append(post)
# Grabs the 1000 most popular reddit posts in the subreddit "r/anxiety"
r_anxiety_posts = []
for post in reddit.subreddit("anxiety").top("all",limit=1000):
r_anxiety_posts.append(post)
# Puts the r/depression posts into a dataframe with relevant information
r_depression_post_info = []
for post in r_depression_posts:
r_depression_post_info.append([post.title,
post.score,
post.selftext,
post.num_comments,
get_date(post)])
r_depression_df = pd.DataFrame(r_depression_post_info,
columns=["Title","Score","Post Body","Comments","Date"])
# Puts the r/anxiety posts into a dataframe with relevant information
r_anxiety_post_info = []
for post in r_anxiety_posts:
r_anxiety_post_info.append([post.title,
post.score,
post.selftext,
post.num_comments,
get_date(post)])
r_anxiety_df = pd.DataFrame(r_anxiety_post_info,
columns=["Title","Score","Post Body","Comments","Date"])
# We will see how many of the top 1000 posts of each subreddit come from each year
year = [0] * len(r_anxiety_df)
for i in range(len(r_anxiety_df)):
year[i] = r_anxiety_df["Date"][i][:4]
r_anxiety_df["Year"] = year
year = [0] * len(r_depression_df)
for i in range(len(r_depression_df)):
year[i] = r_depression_df["Date"][i][:4]
r_depression_df["Year"] = year
depression_labs,depression_counts = np.unique(r_depression_df["Year"],return_counts=True)
depression_year_freq = pd.DataFrame({"Year":depression_labs,
"Frequency":depression_counts})
ax = sns.lineplot(x="Year",y="Frequency",data = depression_year_freq)
ax.set(title = "Frequency of Top 1000 Posts on r/Depression")
# Gets the unique years and how frequently they occur
anxiety_labs,anxiety_counts = np.unique(r_anxiety_df["Year"],return_counts=True)
anxiety_year_freq = pd.DataFrame({"Year":anxiety_labs,
"Frequency":anxiety_counts})
# Plots the frequency
ax = sns.lineplot(x="Year",y="Frequency",data = anxiety_year_freq)
ax.set(title = "Frequency of Top 1000 Posts on r/Anxiety")
# We will have a list of key words to identify posts that are related to covid
covid_key_words = ["covid",
"corona",
"lockdown",
"epidemic"]
# Count the mentions of covid in the r/depression posts
covid_mentions = [0] * len(r_depression_df)
for i in range(len(r_depression_df)):
count = 0
for word in covid_key_words:
if word in r_depression_df["Post Body"][i].lower():
count +=1
if word in r_depression_df["Title"][i].lower():
count +=1
covid_mentions[i] = count
# Create a new column in the data frame
r_depression_df["Covid Mentions"] = covid_mentions
# Group the covid mentions by date
r_depression_by_date = r_depression_df.groupby("Date")["Covid Mentions"].sum()
# Put the dates and covid mentions in a table
table2 = pd.DataFrame({"Date":pd.to_datetime(r_depression_by_date.index),
"Covid Mentions":list(r_depression_by_date)})
# Create a heat map where the darker the box, the higher the frequency of posts
ax = sns.histplot(x="Date",y="Covid Mentions",data = table2)
ax.set(title = "Heat Map of Covid-19 Related Posts on r/Depression")
# Notice how there is a very high frequency of posts during the start of lockdown (December 2019)
# Counts how many times our key words are mentioned in either the title or body of the post
covid_mentions = [0] * len(r_anxiety_df)
for i in range(len(r_anxiety_df)):
count = 0
for word in covid_key_words:
if word in r_anxiety_df["Post Body"][i].lower():
count +=1
if word in r_anxiety_df["Title"][i].lower():
count +=1
covid_mentions[i] = count
# Create a new column in our data frame with how many counts of covid references
r_anxiety_df["Covid Mentions"] = covid_mentions
# Group the posts by date
r_anxiety_by_date = r_anxiety_df.groupby("Date")["Covid Mentions"].sum()
# Put the dates and covid references in a data frame
table = pd.DataFrame({"Date":pd.to_datetime(r_anxiety_by_date.index),
"Covid Mentions":list(r_anxiety_by_date)})
# Create a heatmap where the darker the box, the higher frequency of posts
ax = sns.histplot(x="Date",y="Covid Mentions",data = table)
ax.set(title = "Heat Map of Covid-19 Related Posts on r/Anxiety")
# Notice how although covid isn't mentioned in the majority of posts,
# there is a much higher frequency of posts since the beginning of the lockdown (December 2019)
# Grabs the comments from the top 100 posts
depression_comments = []
for post in r_depression_posts[:100]:
for comment in post.comments:
if isinstance(comment,MoreComments):
continue
depression_comments.append(comment.body)
depression_comments = pd.DataFrame(depression_comments,columns=["body"])
# Puts all of the comments into one string
all_depression_comments = ""
for i in depression_comments["body"]:
all_depression_comments += i
all_depression_comments += " "
# Creates a word cloud out of our comment string
wordcloud = WordCloud(max_words=50,background_color="white").generate(all_depression_comments)
plt.figure(figsize=(15,5))
plt.title("Word Cloud of the Top 100 Posts in r/Depression",size=25)
plt.imshow(wordcloud,interpolation="bilinear")
# Grabs the comments from the top 100 posts
anxiety_comments = []
for post in r_anxiety_posts[:100]:
for comment in post.comments:
if isinstance(comment,MoreComments):
continue
anxiety_comments.append(comment.body)
anxiety_comments = pd.DataFrame(anxiety_comments,columns=["body"])
# Puts all of the comments into one string
all_anxiety_comments = ""
for i in anxiety_comments["body"]:
all_anxiety_comments += i
all_anxiety_comments += " "
# Creates a word cloud out of our comment string
wordcloud = WordCloud(max_words=45,background_color="white").generate(all_anxiety_comments)
plt.figure(figsize=(15,5))
plt.title("Word Cloud of the Top 100 Posts in r/Anxiety",size=25)
plt.imshow(wordcloud,interpolation="bilinear")